{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.11289,
     "end_time": "2020-06-23T19:09:28.704069",
     "exception": false,
     "start_time": "2020-06-23T19:09:28.591179",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Time Series Forecasting using the NOAA Weather Data of JFK Airport (New York)\n",
    "\n",
    "This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/).\n",
    "\n",
    "In this notebook we explore approaches to predicting future temperatures by using the time-series dataset.\n",
    "\n",
    "### Table of Contents:\n",
    "* [1. Read the Cleaned Data](#cell1)\n",
    "* [2. Explore Baseline Models](#cell2)\n",
    "* [3. Train Statistical Time-series Analysis Models](#cell3)\n",
    "* [Authors](#cell4)\n",
    "\n",
    "#### Import required modules\n",
    "\n",
    "Import and configure the required modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:09:28.847867Z",
     "iopub.status.busy": "2020-06-23T19:09:28.818724Z",
     "iopub.status.idle": "2020-06-23T19:10:35.430435Z",
     "shell.execute_reply": "2020-06-23T19:10:35.429274Z"
    },
    "papermill": {
     "duration": 66.690879,
     "end_time": "2020-06-23T19:10:35.430701",
     "exception": false,
     "start_time": "2020-06-23T19:09:28.739822",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install statsmodels\n",
    "!pip install sklearn\n",
    "!pip install matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:36.107462Z",
     "iopub.status.busy": "2020-06-23T19:10:36.106220Z",
     "iopub.status.idle": "2020-06-23T19:10:38.859278Z",
     "shell.execute_reply": "2020-06-23T19:10:38.860272Z"
    },
    "papermill": {
     "duration": 3.115861,
     "end_time": "2020-06-23T19:10:38.860579",
     "exception": false,
     "start_time": "2020-06-23T19:10:35.744718",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from statsmodels.tsa.statespace.sarimax import SARIMAX\n",
    "from matplotlib import pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.304667,
     "end_time": "2020-06-23T19:10:39.610917",
     "exception": false,
     "start_time": "2020-06-23T19:10:39.306250",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell1\"></a>\n",
    "\n",
    "### 1. Read the Cleaned Data\n",
    "\n",
    "We start by reading the cleaned dataset that was created in notebook `Part 1 - Data Cleaning`. \n",
    "\n",
    "**Note:** if you haven't yet run this notebook, run it first; otherwise the cells below will not work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:40.252628Z",
     "iopub.status.busy": "2020-06-23T19:10:40.247924Z",
     "iopub.status.idle": "2020-06-23T19:10:42.107202Z",
     "shell.execute_reply": "2020-06-23T19:10:42.106129Z"
    },
    "papermill": {
     "duration": 2.190699,
     "end_time": "2020-06-23T19:10:42.107565",
     "exception": false,
     "start_time": "2020-06-23T19:10:39.916866",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv('data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv', parse_dates=['DATE'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:42.780891Z",
     "iopub.status.busy": "2020-06-23T19:10:42.779724Z",
     "iopub.status.idle": "2020-06-23T19:10:44.067874Z",
     "shell.execute_reply": "2020-06-23T19:10:44.066877Z"
    },
    "papermill": {
     "duration": 1.586342,
     "end_time": "2020-06-23T19:10:44.068083",
     "exception": false,
     "start_time": "2020-06-23T19:10:42.481741",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Set date index\n",
    "data = data.set_index(pd.DatetimeIndex(data['DATE']))\n",
    "data.drop(['DATE'], axis=1, inplace=True)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.289625,
     "end_time": "2020-06-23T19:10:44.726625",
     "exception": false,
     "start_time": "2020-06-23T19:10:44.437000",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "For purposes of time-series modeling, we will restrict our analysis to a 2-year sample of the dataset to avoid overly long model-training times. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:45.327525Z",
     "iopub.status.busy": "2020-06-23T19:10:45.326317Z",
     "iopub.status.idle": "2020-06-23T19:10:45.343228Z",
     "shell.execute_reply": "2020-06-23T19:10:45.342339Z"
    },
    "papermill": {
     "duration": 0.321659,
     "end_time": "2020-06-23T19:10:45.343406",
     "exception": false,
     "start_time": "2020-06-23T19:10:45.021747",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "sample = data['2016-01-01':'2018-01-01']\n",
    "sample.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.293214,
     "end_time": "2020-06-23T19:10:45.944510",
     "exception": false,
     "start_time": "2020-06-23T19:10:45.651296",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### Create Training/Validation/Test Splits\n",
    "\n",
    "Before we attempt any time-series analysis and prediction, we should split the dataset into training, validation and test sets. We use a portion of the data for training, and a portion of _future_ data for our validation and test sets.\n",
    "\n",
    "If we instead trained a model on the full dataset, the model would learn to be very good at making predictions on that particular dataset, essentially just copying the answers it knows. However, when presented with data the model _has not seen_ , it would perform poorly since it has not learned how to generalize its answers.\n",
    "\n",
    "By training on a portion of the dataset and testing the model's performance on another portion of the dataset (which data the model has not seen in training), we try to avoid our models \"over-fitting\" the dataset and make them better at predicting temperatures given unseen, future data. This process of splitting the dataset and evaluating a model's performance on the validation and test sets is commonly known as [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).\n",
    "\n",
    "By default here we use 80% of the data for the training set and 10% each for validation and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:46.594414Z",
     "iopub.status.busy": "2020-06-23T19:10:46.591764Z",
     "iopub.status.idle": "2020-06-23T19:10:46.617663Z",
     "shell.execute_reply": "2020-06-23T19:10:46.600939Z"
    },
    "papermill": {
     "duration": 0.348951,
     "end_time": "2020-06-23T19:10:46.620618",
     "exception": false,
     "start_time": "2020-06-23T19:10:46.271667",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def split_data(data, val_size=0.1, test_size=0.1):\n",
    "    \"\"\"\n",
    "    Splits data to training, validation and testing parts\n",
    "    \"\"\"\n",
    "    ntest = int(round(len(data) * (1 - test_size)))\n",
    "    nval = int(round(len(data) * (1 - test_size - val_size)))\n",
    "\n",
    "    df_train, df_val, df_test = data.iloc[:nval], data.iloc[nval:ntest], data.iloc[ntest:]\n",
    "    \n",
    "    return df_train, df_val, df_test\n",
    "\n",
    "\n",
    "# Create data split\n",
    "df_train, df_val, df_test = split_data(sample)\n",
    "\n",
    "print('Total data size:      {} rows'.format(len(sample)))\n",
    "print('Training set size:    {} rows'.format(len(df_train)))\n",
    "print('Validation set size:  {} rows'.format(len(df_val)))\n",
    "print('Test set size:        {} rows'.format(len(df_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.414963,
     "end_time": "2020-06-23T19:10:47.400018",
     "exception": false,
     "start_time": "2020-06-23T19:10:46.985055",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell2\"></a>\n",
    "\n",
    "### 2. Explore Baseline Models\n",
    "\n",
    "In this section, we'll create a few simple predictive models of temperature, using shifting and rolling averages. These will serve as a baseline against which we can compare more sophisticated models.\n",
    "\n",
    "Using values at recent timesteps (such as the most recent timestep `t-1` and second-most recent timestep `t-2`) to predict the current value at time `t` is what's known as persistence modeling, or using the last observed value to predict the next following value. These preceding timesteps are often referred to in time-series analysis as `lags`. So, the value at time `t-1` is known as the `1st lag` and the value at time `t-2` is the `2nd lag`.\n",
    "\n",
    "We can also create baselines based on rolling (or moving) averages. This is a time-series constructed by averaging each lagged value up to the selected lag. For example, a 6-period (or 6-lag) rolling avearge is the average of the previous 6 hourly lags `t-6` to `t-1`.\n",
    "\n",
    "Our baseline models will be:\n",
    "1. `1st lag` - i.e. values at `t-1`\n",
    "2. `2nd lag` - i.e. values at `t-2`\n",
    "3. `6-lag` rolling average\n",
    "4. `12-lag` rolling average"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:48.228674Z",
     "iopub.status.busy": "2020-06-23T19:10:48.227122Z",
     "iopub.status.idle": "2020-06-23T19:10:48.275661Z",
     "shell.execute_reply": "2020-06-23T19:10:48.277404Z"
    },
    "papermill": {
     "duration": 0.47806,
     "end_time": "2020-06-23T19:10:48.277812",
     "exception": false,
     "start_time": "2020-06-23T19:10:47.799752",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# define the column containing the data we wish to model - in this case Dry Bulb Temperature (F)\n",
    "Y_COL = 'dry_bulb_temp_f'\n",
    "\n",
    "# Use shifting and rolling averages to predict Y_COL (t)\n",
    "n_in = 2\n",
    "n_out = 1\n",
    "features = [Y_COL]\n",
    "n_features = len(features)\n",
    "\n",
    "# create the baseline on the entire sample dataset.\n",
    "# we will evaluate the prediction error on the validation set\n",
    "baseline = sample[[Y_COL]].loc[:]\n",
    "baseline['{} (t-1)'.format(Y_COL)] = baseline[Y_COL].shift(1)\n",
    "baseline['{} (t-2)'.format(Y_COL)] = baseline[Y_COL].shift(2)\n",
    "baseline['{} (6hr rollavg)'.format(Y_COL)] = baseline[Y_COL].rolling('6H').mean()\n",
    "baseline['{} (12hr rollavg)'.format(Y_COL)] = baseline[Y_COL].rolling('12H').mean()\n",
    "baseline.dropna(inplace=True)\n",
    "baseline.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.346401,
     "end_time": "2020-06-23T19:10:48.951816",
     "exception": false,
     "start_time": "2020-06-23T19:10:48.605415",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Next, we will plot data from our validation dataset to get a sense for how well these baseline models predict the next hourly temperature. Note thatd we only use a few days of data in order to make the plot easier to view."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:49.600346Z",
     "iopub.status.busy": "2020-06-23T19:10:49.598924Z",
     "iopub.status.idle": "2020-06-23T19:10:49.605543Z",
     "shell.execute_reply": "2020-06-23T19:10:49.606223Z"
    },
    "papermill": {
     "duration": 0.339705,
     "end_time": "2020-06-23T19:10:49.606457",
     "exception": false,
     "start_time": "2020-06-23T19:10:49.266752",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# plot first 7 days of the validation set, 168 hours \n",
    "start = df_val.index[0]\n",
    "end = df_val.index[167]\n",
    "sliced = baseline[start:end]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:50.283534Z",
     "iopub.status.busy": "2020-06-23T19:10:50.282583Z",
     "iopub.status.idle": "2020-06-23T19:10:51.038055Z",
     "shell.execute_reply": "2020-06-23T19:10:51.036719Z"
    },
    "papermill": {
     "duration": 1.113089,
     "end_time": "2020-06-23T19:10:51.038337",
     "exception": false,
     "start_time": "2020-06-23T19:10:49.925248",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Plot baseline predictions sample\n",
    "cols = ['dry_bulb_temp_f', 'dry_bulb_temp_f (t-1)', 'dry_bulb_temp_f (t-2)', 'dry_bulb_temp_f (6hr rollavg)', 'dry_bulb_temp_f (12hr rollavg)']\n",
    "sliced[cols].plot()\n",
    "\n",
    "plt.legend(['t', 't-1', 't-2', '6hr', '12hr'], loc=2, ncol=3)\n",
    "plt.title('Baselines for First 7 Days of Validation Set')\n",
    "plt.ylabel('Temperature (F)')\n",
    "plt.tight_layout()\n",
    "plt.rcParams['figure.dpi'] = 100\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.449742,
     "end_time": "2020-06-23T19:10:51.905524",
     "exception": false,
     "start_time": "2020-06-23T19:10:51.455782",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### Evaluate baseline models\n",
    "\n",
    "As you can perhaps see from the graph above, the _lagged_ baselines appear to do a better job of forecasting temperatures than the _rolling average_ baselines.\n",
    "\n",
    "In order to evaluate our baseline models more precisely, we need to answer the question _\"how well do our models predict future temperature?\"_. In regression problems involving prediction of a numerical value, we often use a measure of the difference between our predicted value and the actual value. This is referred to as an error measure or error metric. A common measure is the Mean Squared Error (MSE):\n",
    "\n",
    "\\begin{equation}\n",
    "MSE = \\frac{1}{n} \\sum_{i=1}^{n}{(y_i - \\hat y_i)^{2}}\n",
    "\\end{equation}\n",
    "\n",
    "This is the average of the squared differences between predicted values $ \\hat y $ and actual values $ y $.\n",
    "\n",
    "Because the MSE is in \"units squared\" it can be difficult to interpet, hence the Root Mean Squared Error (RMSE) is often used:\n",
    "\n",
    "\\begin{equation}\n",
    "RMSE = \\sqrt {MSE} \n",
    "\\end{equation}\n",
    "\n",
    "This is the square root of the MSE, and is in the same units as the values $ y $. We can compare the RMSE (and MSE) values for different models and say that the model that has the lower MSE is better at predicting temperatures, all things equal. Note that MSE and RMSE will grow large quickly if the differences between predicted and actual values are large. This may or may not be a desired quality of your error measure. In this case, it is probably a good thing, since a model that makes large mistakes in temperature prediction will be much less useful than one which makes small mistakes.\n",
    "\n",
    "Next, we calculate the RMSE measure for each of our baseline models, on the full validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:52.661332Z",
     "iopub.status.busy": "2020-06-23T19:10:52.659720Z",
     "iopub.status.idle": "2020-06-23T19:10:52.711124Z",
     "shell.execute_reply": "2020-06-23T19:10:52.709941Z"
    },
    "papermill": {
     "duration": 0.442586,
     "end_time": "2020-06-23T19:10:52.711390",
     "exception": false,
     "start_time": "2020-06-23T19:10:52.268804",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Calculating baseline RMSE\n",
    "start_val = df_val.index[0]\n",
    "end_val = df_val.index[-1]\n",
    "baseline_val = baseline[start_val:end_val]\n",
    "\n",
    "baseline_y = baseline_val[Y_COL]\n",
    "baseline_t1 = baseline_val['dry_bulb_temp_f (t-1)']\n",
    "baseline_t2 = baseline_val['dry_bulb_temp_f (t-2)']\n",
    "baseline_avg6 = baseline_val['dry_bulb_temp_f (6hr rollavg)']\n",
    "baseline_avg12 = baseline_val['dry_bulb_temp_f (12hr rollavg)']\n",
    "\n",
    "rmse_t1 = round(np.sqrt(mean_squared_error(baseline_y, baseline_t1)), 2)\n",
    "rmse_t2 = round(np.sqrt(mean_squared_error(baseline_y, baseline_t2)), 2)\n",
    "rmse_avg6 = round(np.sqrt(mean_squared_error(baseline_y, baseline_avg6)), 2)\n",
    "rmse_avg12 = round(np.sqrt(mean_squared_error(baseline_y, baseline_avg12)), 2)\n",
    "\n",
    "print('Baseline t-1 RMSE:            {0:.3f}'.format(rmse_t1))\n",
    "print('Baseline t-2 RMSE:            {0:.3f}'.format(rmse_t2))\n",
    "print('Baseline 6hr rollavg RMSE:    {0:.3f}'.format(rmse_avg6))\n",
    "print('Baseline 12hr rollavg RMSE:   {0:.3f}'.format(rmse_avg12))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.311905,
     "end_time": "2020-06-23T19:10:53.380315",
     "exception": false,
     "start_time": "2020-06-23T19:10:53.068410",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The RMSE results confirm what we saw in the graph above. It is clear that the _rolling average_ baselines perform poorly. In fact, the `t-2` lagged baseline is also not very good. It appears that the best baseline model is to simply use the current hour's temperature to predict the next hour's temperature!\n",
    "\n",
    "Can we do better than this simple baseline using more sophisticated models?\n",
    "\n",
    "<a id=\"cell3\"></a>\n",
    "\n",
    "### 2. Train Statistical Time-series Analysis Models\n",
    "\n",
    "\n",
    "In the previous section, we saw that a simple `lag-1` baseline model performed reasonably well at forecasting temperature for the next hourly time step. This is perhaps not too surprising, given what we know about hourly temperatures. Generally, the temperature in a given hour will be quite closely related to the temperature in the previous hour. This phenomenon is very common in time-series analysis and is known as [autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation) - that is, the time series is `correlated` with previous values of itself. More precisely, the values at time `t` are correlated with lagged values (which could be `t-1`, `t-2` and so on).\n",
    "\n",
    "Another thing we saw previously is the concept of _moving averages_. In this case the moving-average baseline was not that good at prediction. However it is common in many time-series for a moving average to capture some of the underlying structure and be useful for prediction.\n",
    "\n",
    "In order to make our model better at predicting temperature, ideally we would want to take these aspects into account. Fortunately, the statistical community has a long history of analyzing time series and has created many different forecasting models.\n",
    "\n",
    "Here, we will explore one called SARIMAX - the **S**easonal **A**uto**R**egressive **I**ntegrated **M**oving **A**verage with e**X**ogenous regressors model. \n",
    "\n",
    "This sounds like a very complex name, but if we look at the components of the name, we see that it includes `autocorrelation` (this is what auto regressive means) and `moving averages`, which are the components mentioned above. \n",
    "\n",
    "The SARIMAX model also allows including a _seasonal_ model component as well as handling *exogenous* variables, which are external to the time-series value itself. For example, for temperature prediction we may wish to take into account not just previous temperature values, but perhaps other weather features which may have an effect on temperature (such as humidity, rainfall, wind, and so on).\n",
    "\n",
    "For the purposes of this notebook, we will not explore modeling of seasonal components or exogenous variables.\n",
    "\n",
    "If we drop the \"S\" and \"X\" from the model, we are left with an [ARIMA model](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) (Auto-regressive Integrated Moving Average). This is a very commonly used model for time-series analysis and we will use it in this notebook by only specifying the relevant model components of the full SARIMAX model.\n",
    "\n",
    "#### 2.1 Replicating a baseline model\n",
    "\n",
    "As a starting point, we will see how we can use SARIMAX to create a simple model that in fact replicates one of the baselines we created previously. Auto-regression, as we have seen, means using values from preceding time periods to predict the current value. Recall that one of our baseline models was the `1st lag` or `t-1` model. In time-series analysis this is referred to as an **AR(1)** model, meaning an **A**uto-**R**egressive model for `lag 1`.\n",
    "\n",
    "Technically, the AR(1) model is not exactly the same as our baseline model. A statistical time series model like SARIMAX learns a set of `weights` to apply to each component of the model. These weights are set so as to best fit the dataset. We can think of our baseline as setting the `weight` for the `t-1` lag to be exactly `1`. In practice, our time-series model will not have a weight of exactly `1` (though it will likely be very close to that), hence the predictions will be slightly different.\n",
    "\n",
    "Now, lets fit our model to the dataset. First, we will set up the model inputs by taking the temperature column of our dataframe. We do this for training and validation sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:54.051426Z",
     "iopub.status.busy": "2020-06-23T19:10:54.049923Z",
     "iopub.status.idle": "2020-06-23T19:10:54.054002Z",
     "shell.execute_reply": "2020-06-23T19:10:54.055130Z"
    },
    "papermill": {
     "duration": 0.363962,
     "end_time": "2020-06-23T19:10:54.055438",
     "exception": false,
     "start_time": "2020-06-23T19:10:53.691476",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "X_train = df_train[Y_COL]\n",
    "X_val = df_val[Y_COL]\n",
    "X_both = np.hstack((X_train, X_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.406823,
     "end_time": "2020-06-23T19:10:54.810765",
     "exception": false,
     "start_time": "2020-06-23T19:10:54.403942",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Here we created a variable called `X_both` to cover both the training and validation data. This is required later when we forecast values for our SARIMAX model, in order to give the model access to all the datapoints for which it must create forecasts. Note that the forecasts themselves will only be based on the _model weights_ learned from the training data (this is important for over-fitting as we have seen above)!\n",
    "\n",
    "The SARIMAX model takes an argument called `order`: this specifies the components of the model and itself has 3 parts: `(p, d, q)`. `p` denotes the lags for the AR model and `q` denotes the lags for the MA model. We will not cover the `d` parameter here. Taken together this specifies the parameters of the [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) model portion of SARIMAX.\n",
    "\n",
    "To create an AR(1) model, we set the `order` to be `(1, 0, 0)`. This sets up the AR model to be a `lag 1` model. Then, we fit our model on the training data and inspect a summary of the trained model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:55.604704Z",
     "iopub.status.busy": "2020-06-23T19:10:55.603585Z",
     "iopub.status.idle": "2020-06-23T19:10:56.693475Z",
     "shell.execute_reply": "2020-06-23T19:10:56.696932Z"
    },
    "papermill": {
     "duration": 1.443733,
     "end_time": "2020-06-23T19:10:56.697419",
     "exception": false,
     "start_time": "2020-06-23T19:10:55.253686",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "order = (1, 0, 0)\n",
    "model_ar1 = SARIMAX(X_train, order=order)\n",
    "results_ar1 = model_ar1.fit()\n",
    "results_ar1.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.344413,
     "end_time": "2020-06-23T19:10:57.353941",
     "exception": false,
     "start_time": "2020-06-23T19:10:57.009528",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "There's quite a lot of information printed out in the model summary above. Much of it is related to the statistical properties of our model.\n",
    "\n",
    "The most important thing for now is to look at the second table, where we can see a `coef` value of `0.9996` for the weight `ar.L1`. This tells us the model has set a weight for the `1st lag` component of the AR model to be `0.9996`. This is almost `1` and hence we should expect the prediction results to indeed be close to our `t-1` baseline.\n",
    "\n",
    "Let's create our model forecast on the validation dataset. We will then plot a few data points like we did with our baseline models (using 7 days of validation data) and compute the RMSE value based on the full validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:57.993230Z",
     "iopub.status.busy": "2020-06-23T19:10:57.991479Z",
     "iopub.status.idle": "2020-06-23T19:10:58.136887Z",
     "shell.execute_reply": "2020-06-23T19:10:58.136120Z"
    },
    "papermill": {
     "duration": 0.442245,
     "end_time": "2020-06-23T19:10:58.137195",
     "exception": false,
     "start_time": "2020-06-23T19:10:57.694950",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "full_data_ar1 = SARIMAX(X_both, order=order)\n",
    "model_forecast_ar1 = full_data_ar1.filter(results_ar1.params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:10:58.833429Z",
     "iopub.status.busy": "2020-06-23T19:10:58.832088Z",
     "iopub.status.idle": "2020-06-23T19:10:59.277366Z",
     "shell.execute_reply": "2020-06-23T19:10:59.278452Z"
    },
    "papermill": {
     "duration": 0.805526,
     "end_time": "2020-06-23T19:10:59.278758",
     "exception": false,
     "start_time": "2020-06-23T19:10:58.473232",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "start = len(X_train)\n",
    "end = len(X_both)\n",
    "forecast_ar1 = model_forecast_ar1.predict(start=start, end=end - 1, dynamic=False)\n",
    "\n",
    "# plot actual vs predicted values for the same 7-day window for easier viewing\n",
    "plt.plot(sliced[Y_COL].values)\n",
    "plt.plot(forecast_ar1[:168], color='r', linestyle='--')\n",
    "plt.legend(['t', 'AR(1)'], loc=2)\n",
    "plt.title('AR(1) Model Predictions for First 7 Days of Validation Set')\n",
    "plt.ylabel('Temperature (F)')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.32596,
     "end_time": "2020-06-23T19:10:59.905046",
     "exception": false,
     "start_time": "2020-06-23T19:10:59.579086",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We can see that the plot looks almost identical to the plot above, for the `t` and `t-1 baseline` values.\n",
    "\n",
    "Next, we compute the RMSE values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:11:00.577864Z",
     "iopub.status.busy": "2020-06-23T19:11:00.576393Z",
     "iopub.status.idle": "2020-06-23T19:11:00.580976Z",
     "shell.execute_reply": "2020-06-23T19:11:00.578703Z"
    },
    "papermill": {
     "duration": 0.307046,
     "end_time": "2020-06-23T19:11:00.581216",
     "exception": false,
     "start_time": "2020-06-23T19:11:00.274170",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# compute print RMSE values\n",
    "rmse_ar1 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar1))\n",
    "print('AR(1) RMSE:                   {0:.3f}'.format(rmse_ar1))\n",
    "print('Baseline t-1 RMSE:            {0:.3f}'.format(rmse_t1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.271662,
     "end_time": "2020-06-23T19:11:01.146967",
     "exception": false,
     "start_time": "2020-06-23T19:11:00.875305",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We can see that the RMSE values for the validation set also almost identical.\n",
    "\n",
    "#### 2.2 Create a more complex model\n",
    "\n",
    "One of our baseline models was a `lag 2` model, i.e. `t-2`. We saw that it performed a lot worse than the `t-1` baseline. Intuitively, this makes sense, since we are throwing away a lot of information about the most recent lag `t-1`. However, the `t-2` lag still provides some useful information. In fact, for temperature prediction it's likely that the last few hours can provide some value.\n",
    "\n",
    "Fortunately, our ARIMA model framework provides an easy way to incorporate further lag information. We can construct a model that includes _both_ the `t-1` and `t-2` lags. This is an **AR(2)** model (meaning an auto-regressive model up to lag `2`). We can specify this with the model order parameter `p=2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:11:01.736464Z",
     "iopub.status.busy": "2020-06-23T19:11:01.734470Z",
     "iopub.status.idle": "2020-06-23T19:11:02.901803Z",
     "shell.execute_reply": "2020-06-23T19:11:02.900812Z"
    },
    "papermill": {
     "duration": 1.472119,
     "end_time": "2020-06-23T19:11:02.902042",
     "exception": false,
     "start_time": "2020-06-23T19:11:01.429923",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "order = (2, 0, 0)\n",
    "model_ar2 = SARIMAX(X_train, order=order)\n",
    "results_ar2 = model_ar2.fit()\n",
    "results_ar2.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.3617,
     "end_time": "2020-06-23T19:11:03.596315",
     "exception": false,
     "start_time": "2020-06-23T19:11:03.234615",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "This time, the results table indicates a weight for variable `ar.L1` _and_ `ar.L2`. Note the values are now quite different from `1` (or `0.5` say, for a simple equally-weighted model). Next, we compute the RMSE on the validation set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:11:04.355345Z",
     "iopub.status.busy": "2020-06-23T19:11:04.354117Z",
     "iopub.status.idle": "2020-06-23T19:11:04.590484Z",
     "shell.execute_reply": "2020-06-23T19:11:04.589051Z"
    },
    "papermill": {
     "duration": 0.633402,
     "end_time": "2020-06-23T19:11:04.590648",
     "exception": false,
     "start_time": "2020-06-23T19:11:03.957246",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "full_data_ar2 = SARIMAX(X_both, order=order)\n",
    "model_forecast_ar2 = full_data_ar2.filter(results_ar2.params)\n",
    "\n",
    "start = len(X_train)\n",
    "end = len(X_both)\n",
    "forecast_ar2 = model_forecast_ar2.predict(start=start, end=end - 1, dynamic=False)\n",
    "\n",
    "# compute print RMSE values\n",
    "rmse_ar2 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar2))\n",
    "print('AR(2) RMSE:                   {0:.3f}'.format(rmse_ar2))\n",
    "print('AR(1) RMSE:                   {0:.3f}'.format(rmse_ar1))\n",
    "print('Baseline t-1 RMSE:            {0:.3f}'.format(rmse_t1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.31576,
     "end_time": "2020-06-23T19:11:05.274842",
     "exception": false,
     "start_time": "2020-06-23T19:11:04.959082",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We've improved the RMSE value by including information from the first two lags.\n",
    "\n",
    "In fact, you will see that if you continue to increase the `p` parameter value, the RMSE will continue to decrease, indicating that a few recent lags provide useful information to our model.\n",
    "\n",
    "#### 2.3 Incorporate moving averages\n",
    "\n",
    "Finally, what if we also include moving average information in our model? The ARIMA framework makes this easy to do, by setting the order parameter `q`. A value of `q=1` specifies a **MA(1)** model (including the first lag `t-1`), while `q=6` would include all the lags from `t-1` to `t-6`.\n",
    "\n",
    "Note that the moving average model component is a little different from the simple moving or rolling averages computed in the baseline models. The [definition of the MA model](https://en.wikipedia.org/wiki/Moving-average_model) is rather technical, but conceptually you can think of it as using a form of weighted moving average (compared to our baseline which would be a simple, unweighted average).\n",
    "\n",
    "Let's add an MA(1) component to our AR(2) model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:11:05.922543Z",
     "iopub.status.busy": "2020-06-23T19:11:05.921432Z",
     "iopub.status.idle": "2020-06-23T19:11:08.188730Z",
     "shell.execute_reply": "2020-06-23T19:11:08.187498Z"
    },
    "papermill": {
     "duration": 2.562284,
     "end_time": "2020-06-23T19:11:08.188977",
     "exception": false,
     "start_time": "2020-06-23T19:11:05.626693",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "order = (2, 0, 1)\n",
    "model_ar2ma1 = SARIMAX(X_train, order=order)\n",
    "results_ar2ma1 = model_ar2ma1.fit()\n",
    "results_ar2ma1.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.422086,
     "end_time": "2020-06-23T19:11:08.972375",
     "exception": false,
     "start_time": "2020-06-23T19:11:08.550289",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We see the results table shows an additional weight value for `ma.L1`, our MA(1) component. Next, we compare the RMSE to the other models and finally plot all the model forecasts together - _note_ we use a much smaller 48-hour window to make the plot readable for illustrative purposes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:11:09.576061Z",
     "iopub.status.busy": "2020-06-23T19:11:09.574047Z",
     "iopub.status.idle": "2020-06-23T19:11:09.932432Z",
     "shell.execute_reply": "2020-06-23T19:11:09.931384Z"
    },
    "papermill": {
     "duration": 0.688404,
     "end_time": "2020-06-23T19:11:09.932680",
     "exception": false,
     "start_time": "2020-06-23T19:11:09.244276",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "full_data_ar2ma1 = SARIMAX(X_both, order=order)\n",
    "model_forecast_ar2ma1 = full_data_ar2ma1.filter(results_ar2ma1.params)\n",
    "\n",
    "start = len(X_train)\n",
    "end = len(X_both)\n",
    "forecast_ar2ma1 = model_forecast_ar2ma1.predict(start=start, end=end - 1, dynamic=False)\n",
    "\n",
    "# compute print RMSE values\n",
    "rmse_ar2ma1 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar2ma1))\n",
    "print('AR(2) MA(1) RMSE:             {0:.3f}'.format(rmse_ar2ma1))\n",
    "print('AR(2) RMSE:                   {0:.3f}'.format(rmse_ar2))\n",
    "print('AR(1) RMSE:                   {0:.3f}'.format(rmse_ar1))\n",
    "print('Baseline t-1 RMSE:            {0:.3f}'.format(rmse_t1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T19:11:10.569884Z",
     "iopub.status.busy": "2020-06-23T19:11:10.564385Z",
     "iopub.status.idle": "2020-06-23T19:11:11.060827Z",
     "shell.execute_reply": "2020-06-23T19:11:11.061781Z"
    },
    "papermill": {
     "duration": 0.80721,
     "end_time": "2020-06-23T19:11:11.061997",
     "exception": false,
     "start_time": "2020-06-23T19:11:10.254787",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# plot actual vs predicted values for a smaller 2-day window for easier viewing\n",
    "hrs = 48\n",
    "plt.plot(sliced[Y_COL][:hrs].values)\n",
    "plt.plot(forecast_ar1[:hrs], color='r', linestyle='--')\n",
    "plt.plot(forecast_ar2[:hrs], color='g', linestyle='--')\n",
    "plt.plot(forecast_ar2ma1[:hrs], color='c', linestyle='--')\n",
    "plt.legend(['t', 'AR(1)', 'AR(2)', 'AR(2) MA(1)'], loc=2, ncol=1)\n",
    "plt.title('ARIMA Model Predictions for First 48 hours of Validation Set')\n",
    "plt.ylabel('Temperature (F)')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.281868,
     "end_time": "2020-06-23T19:11:11.631851",
     "exception": false,
     "start_time": "2020-06-23T19:11:11.349983",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We've again managed to reduce the RMSE value for our model, indicating that adding the MA(1) component has improved our forecast!\n",
    "\n",
    "Congratulations! You've applied the basics of time-series analysis for forecasting hourly temperatures. See if you can further improve the RMSE values by exploring the different values for the model parameters `p`, `q` and even `d`!\n",
    "\n",
    "<a id=\"authors\"></a> \n",
    "### Authors\n",
    "\n",
    "This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).\n",
    "\n",
    "Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "papermill": {
   "duration": 105.745224,
   "end_time": "2020-06-23T19:11:12.732957",
   "environment_variables": {},
   "exception": null,
   "input_path": "Part 3 - Time Series Forecasting.ipynb",
   "output_path": "Part 3 - Time Series Forecasting-output.ipynb",
   "parameters": {},
   "start_time": "2020-06-23T19:09:26.987733",
   "version": "2.1.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
